1. Introduction to Open Data Science

This course is all about finding the joy in doing science together. It’s about learning to use open data sources with tools that are available to everyone.

https://github.com/pauliinaanttila/IODS-project


2. Regression and model validation

This week we really got started with data wrangling and analysis with RStudio. I haven’t been using RStudio before, so this all has been quite exciting and challenging. After a few hours of trial and error, I’m starting to learn the basic tricks about exploring and analyzing data in RStudio.

Description of the data set

students2014 <- read.csv (file = "~/Documents/GitHub/IODS-project/data/learning2014.csv", header = TRUE, sep = ",") # reading a CSV file named learning2014 into R
str(students2014) # exploring the structure of the data, gives the names and types of variables
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(students2014) # exploring the dimensions of the data, contains 166 observations of 7 variables
## [1] 166   7

We are working with students2014 dataset this week. The dataset contains 166 students who took part in a course called Introduction to Social Statistics in fall 2014 and also attended the final exam. The variables collected to the dataset are gender, age, attitude, deep learning, strategic learning, surface learning and exam points. The attitude towards statistics and details concerning learning methods were gathered through a questionnaire.

Graphical overview of the data

pairs(students2014[-1], col = students2014$gender) # draws all possible scatter plots from the columns of a data frame, resulting in a scatter plot matrix

This scatter plot matrix shows all the possible scatter plots from the columns of the students2014 data frame, but it’s a bit tricky to interpret as such.

summary(students2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

The dataset includes 110 females and 56 males. The mean age of the studied population is 25.5 years. The youngest participant has been 17 years old and the oldest 55 years old. The points for attitude vary between 1.4 and 5 with a mean of 3.14. The exam points vary between 7 and 33 with a mean of 22.7. The means for deep, strategic and surface learning are 3.7, 3.1 and 2.8, respectively.

library(GGally)
library(ggplot2)
p <- ggpairs(students2014, mapping = aes(col = gender), lower = list(combo = wrap("facethist", bins = 20)))
p

The age of the studied population is not normally distributed, the mass of the distribution is concentrated to the left of the figure, i.e. the skewness is positive. Said simpler, in the studied population, the majority of the people are in their twenties, but there are a few people that are noticeably older as well.

The distributions of attitude, strategic learning and surface learning are fairly normally distributed. The distributions of deep learning and points are concentrated to the right of the figure, i.e. skewness is negative. In the case of points, there are more students that have scored well in the examination compared to those who have not done so well.

Male students seem to have a bit better exam points than female students. There is a mild negative correlation with age and points. Attitude seems to have a positive correlation with exam points. Strategic learning is advantageous in terms of having good exam points, surface learning has negative correlation with points.

Analysis of the data with regression

Regression analysis is a statistical modelling method used to estimate the relationships among variables. The focus is on the relationship between a dependent variable (here points) and one or more independent variables (here attitude, strategic learning and surface learning).

my_model <- lm(points ~ attitude + stra + surf, data = students2014) # a multiple regression model where points is the dependent variable and attitude, strategic learning and surface learning are explanatory variables
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

In this example, the multiple regression model includes points as a dependent variable and attitude, strategic learning and surface learning as explanatory variables. The only variable that has a statistically significant relationship with points is attitude. The coefficient is 3.4 (p-value 1.93e-08) meaning that one gets better exam points with better attitude.

my_model2 <- lm(points ~ attitude, data = students2014)
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

It seems that better attitude gives one better results in the exam measured by points.

R-squared is a statistical measure of how close the data are to the fitted regression line. In general, a model fits the data well if the differences between the observed values and the model’s predicted values are small and unbiased. In general, the higher the R-squared, the better the model fits your data. The values of R-squared are between 0 and 1 (0-100 %).

The R-squared is moderately low here (0.19), but it is understandable given the nature of the study (any field that attempts to predict human behavior typically has R-squared values lower than 50%, humans are simply harder to predict than, say, physical processes.)

Diagnostic plots

plot(my_model2, which = c(1,2,5))

A bit about diagnostic plots:

Diagnostic plots are used to test the validity of models.

1) Residuals vs Fitted values provides a method to explore residuals versus model predictions. With this method one can test the constant variance assumption. The constant variance assumption implies that the size of errors should not depend on the explanatory variables.

In this example, the assumption is reasonable (the observations reasonably randomly spread).

2) The QQ-plot of the residuals provides a method to explore the assumption that the errors of the model are normally distributed.

The QQ-plot in this example is quite reasonable, even though in both ends the observations differ from the straight line. The normality assumption is reasonable here.

3) Residuals versus leverage plot can help identify which observations have an unusually high impact.

In this example, there are no observations that would have a markably higher leverage.


3. Logistic regression

Description of the data set

This week we are working with a joined dataset that contains two Portuguese student alcohol consumption data sets. Originally, the data sets are found from the UCI Machine Learning Repository under the name “Student performance data set” (donated Nov 2014). The original data sets contain data about student performance in secondary education of two Portuguese schools in two distinct subjects: mathematics and Portuguese language. The data was collected by using school reports and questionnaires. We wrangled the data so that we combined the data sets (mathematics and Portuguese language).

The wrangled data set contains 382 observations of 35 variables. The variables include the following:

1 school - student’s school

2 sex - student’s sex

3 age - student’s age

4 address - student’s home address type

5 famsize - family size

6 Pstatus - parent’s cohabitation status

7 Medu - mother’s education

8 Fedu - father’s education

9 Mjob - mother’s job

10 Fjob - father’s job

11 reason - reason to choose this school

12 guardian - student’s guardian

13 traveltime - home to school travel time

14 studytime - weekly study time

15 failures - number of past class failures

16 schoolsup - extra educational support

17 famsup - family educational support

18 paid - extra paid classes within the course subject (Math or Portuguese)

19 activities - extra-curricular activities

20 nursery - attended nursery school

21 higher - wants to take higher education

22 internet - Internet access at home

23 romantic - with a romantic relationship

24 famrel - quality of family relationships

25 freetime - free time after school

26 goout - going out with friends

27 Dalc - workday alcohol consumption

28 Walc - weekend alcohol consumption

29 health - current health status

30 absences - number of school absences

31 G1 - first period grade

32 G2 - second period grade

33 G3 - final grade

34 alc_use - the average of alcohol consumption on weekdays and weekends

35 high_use - TRUE if “alc_use is higher than 2”

alc <- read.csv (file = "~/Documents/GitHub/IODS-project/data/alc.csv", header = TRUE, sep = ",", dec = ".") # reading data into R
colnames(alc) # printing out the names of the variables
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
dim(alc) # there are 382 observations of 35 variables in the data
## [1] 382  35
str(alc)
## 'data.frame':    382 obs. of  35 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : int  2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : int  8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : int  8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...

Exploring the data set

My hypotheses:

1. Older students consume high amounts of alcohol more often than younger ones.

2. Those who have high use of alcohol tend to fail more often.

3. Those who have extra-curricular activities consume less alcohol.

4. Those who want to take higher education consume less alcohol.

1. Age and alcohol consumption:

library(ggplot2)
p1 <- ggplot(alc, aes(x = high_use, y = age))
p1 + geom_boxplot() + ylab("age")

t1 <- table(alc$high_use, alc$age) # crosstabulation of "high_use" and "age"
t1
##        
##         15 16 17 18 19 20 22
##   FALSE 63 79 64 54  7  1  0
##   TRUE  18 28 36 27  4  0  1

According to the boxplots and crosstabulation, it seems that older students consume high amounts of alcohol more often than the younger ones.

2. Alcohol consumption and failures:

t2 <- table(alc$high_use, alc$failures) # crosstabulation of "high_use" and "failures"
t2
##        
##           0   1   2   3
##   FALSE 244  12  10   2
##   TRUE   90  12   9   3
counts <- table(alc$high_use, alc$failures)
barplot(counts, main="High alcohol use and failures",
  xlab="Failures",
    legend = rownames(counts), beside=TRUE)

Number of failures in the x-axis and count of students on the y-axis. Grouped by “high_use” (either TRUE or FALSE).

Only a small proportion of students have failed and the absolute counts of failures are fairly similar in both groups (those who use high amounts of alcohol vs those who do not). It is to be noted, that there is a lot more students in the latter group (268 vs 114).

3. Alcohol consumption and activities:

t3 <- table(alc$high_use, alc$activities)
t3
##        
##          no yes
##   FALSE 122 146
##   TRUE   59  55
counts <- table(alc$high_use, alc$activities)
barplot(counts, main="High alcohol use and activities",
  xlab="Activities",
    legend = rownames(counts), beside=TRUE)

The proportion of students who consume high amounts of alcohol is bigger in the group that has no activities.

4. Alcohol consumption and the urge to take higher education:

t4 <- table(alc$high_use, alc$higher)
t4
##        
##          no yes
##   FALSE   9 259
##   TRUE    9 105
counts <- table(alc$high_use, alc$higher)
barplot(counts, main="High alcohol use and the urge to take higher education",
  xlab="The urge to take higher education",
    legend = rownames(counts), beside=TRUE)

The majority of the students want to take higher education.

Analysis with logistic regression

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
my_model <- glm (formula = high_use ~ age + failures + activities + higher, data = alc, family ="binomial") # Logistic regression model with high_use as target variable and age, failures, activities and higher education target as explanatory variables.
summary(my_model)
## 
## Call:
## glm(formula = high_use ~ age + failures + activities + higher, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4581  -0.8444  -0.7406   1.3466   1.7584  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)    -3.1430     1.8387  -1.709   0.0874 .
## age             0.1528     0.1003   1.523   0.1279  
## failures        0.4397     0.1948   2.257   0.0240 *
## activitiesyes  -0.1820     0.2290  -0.795   0.4266  
## higheryes      -0.2731     0.5435  -0.503   0.6153  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 453.19  on 377  degrees of freedom
## AIC: 463.19
## 
## Number of Fisher Scoring iterations: 4
coef(my_model) # coefficients of the model
##   (Intercept)           age      failures activitiesyes     higheryes 
##    -3.1429502     0.1527828     0.4397280    -0.1820300    -0.2731132
OR <- coef(my_model) %>% exp # computing odds ratios
CI <- confint(my_model) %>% exp # computing confidence intervals
## Waiting for profiling to be done...
cbind(OR,CI) # printing ORs and CIs
##                      OR       2.5 %   97.5 %
## (Intercept)   0.0431553 0.001136953 1.561687
## age           1.1650719 0.957724023 1.420525
## failures      1.5522849 1.059603697 2.289548
## activitiesyes 0.8335763 0.531757595 1.306573
## higheryes     0.7610067 0.262896907 2.277735

The only examined variable that is statistically significantly associated to high use of alcohol is failures. The odds ratio of failures is 1.55 with confidence interval of 1.06-2.29 (p-value 0.024). With high consumption of alcohol the odds of failing is 1.55 times higher than with smaller alcohol consumption.

The previous data exploring and logistic regression model show similar trends with other variables, but they are not statistically significant.

Exploring the predictive power of my model

my_model2 <- glm(high_use ~ failures, data = alc, family = "binomial") # fitting the model with the only variable (failures) that had statistically significant relationship with high use of alcohol
probabilities <- predict(my_model2, type = "response") # predicting the probability of high use
alc <- mutate(alc, probability = probabilities) # adding the predicted probabilities to 'alc'
alc <- mutate(alc, prediction = probability > 0.5) # using the probabilities to make a prediction of high_use
table(high_use = alc$high_use, prediction = alc$prediction) # tabulation of the target variable versus the predictions
##         prediction
## high_use FALSE TRUE
##    FALSE   256   12
##    TRUE    102   12
library(dplyr)
library(ggplot2)
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction)) # initializing a plot of 'high_use' versus 'probability' in 'alc'
g + geom_point() # defining the geom as points and drawing the plot

table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins() # tabulating the target variable versus the predictions
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.67015707 0.03141361 0.70157068
##    TRUE  0.26701571 0.03141361 0.29842932
##    Sum   0.93717277 0.06282723 1.00000000
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong) 
} # defining a loss function (mean prediction error)
loss_func(class = alc$high_use, prob = alc$probability) # calling loss_func to compute the average number of wrong predictions in the (training) data
## [1] 0.2984293

The average amount of wrong predictions in the (training) data is 29,8%. The proportion is quite high but still better than pure quessing.


4. Clustering and classification

Exploring the data

This week we are working with Boston data (a data set that is included in R, to be more precise, MASS package). The data deals with housing values in suburbs of Boston. The data includes 506 observations of 14 variables.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data(Boston) # loading the Boston data
dim(Boston) # exploring the dimensions of the data
## [1] 506  14
str(Boston) # exploring the structure of the data
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

List of variables:

crim = per capita crime rate by town

zn = proportion of residential land zoned for lots over 25,000 sq.ft

indus = proportion of non-retail business acres per town

chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

nox = nitrogen oxides concentration (parts per 10 million)

rm = average number of rooms per dwelling

age = proportion of owner-occupied units built prior to 1940

dis = weighted mean of distances to five Boston employment centres

rad = index of accessibility to radial highways

tax = full-value property-tax rate per $10,000

ptratio = pupil-teacher ratio by town

black = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

lstat = lower status of the population (percent)

medv = median value of owner-occupied homes in $1000s

The sources of the data

Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.

Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.

Overview of the data

library(corrplot)
## corrplot 0.84 loaded
library(tidyverse)
## ── Attaching packages ─────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble  1.3.4     ✔ purrr   0.2.4
## ✔ tidyr   0.7.2     ✔ stringr 1.2.0
## ✔ readr   1.1.1     ✔ forcats 0.2.0
## ── Conflicts ────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ MASS::select()  masks dplyr::select()
cor_matrix<-cor(Boston) %>% round(2)
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

Above, you can see a correlation plot of the data. It seems, for example, that the index of accessibility to radial highways is strongly positively correlated with full-value property-tax rate. There seems to be a clear negative correlation between lower status of the population and median value of owner-occupied homes.

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Above, you can see summaries of the variables in the data. To mention a few interesting ones:

- Per capita crime rate by town differs substantially between the towns with the minimum of 0.006 and the maximum of 89.0 (median 0.26).

- Pupil-teacher ratio by town varies between 12.6-22.0 (median 19.05).

- Lower status of the population (percent) varies between 1.7-38.0 (median 11.4).

Standardizing the data set

boston_scaled <- scale(Boston) # standardizing the data set
summary(boston_scaled) # summary of the scaled data
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

In the scaling, the column means were subtracted from the corresponding columns and the difference was divided with standard deviation. In the scaled data you can see that the means are now 0 for every variable. Scaling needs to be done so that we can perform the linear discriminant analysis.

boston_scaled <- as.data.frame(boston_scaled)
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low","med_low","med_high","high")) # creating a categorical variable crime with 4 categories (low, med_low, med_high and high)

boston_scaled <- dplyr::select(boston_scaled, -crim) # dropping the old 'crim' variable from the scaled data
boston_scaled <- data.frame(boston_scaled, crime) # adding the new categorical value to scaled data

n <- nrow(boston_scaled)
n # number of rows is 506
## [1] 506
ind <- sample(n,  size = n * 0.8) # choosing randomly 80% of the rows
train <- boston_scaled[ind,] # creating the train set
test <- boston_scaled[-ind,] # creating the test set

Fitting the linear discriminant analysis on the train set

library(MASS)
lda.fit <- lda(formula = crime ~., data = train) # fitting the LDA, the categorical crime rate as the target variable and all the other variables in the dataset as predictor values
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes)

The linear discriminant analysis biplot above.

Predicting with LDA

crime_categories <- test$crime # saving the crime categories from the test set
test <- dplyr::select(test, -crime) # taking the 'crime' variable away from the test set
lda.pred <- predict(lda.fit, newdata = test) # predicting the classes with the LDA model on the test data
table(correct = crime_categories, predicted = lda.pred$class) # cross tabulating the results of prediction with the crime categories from the test set
##           predicted
## correct    low med_low med_high high
##   low       13       9        2    0
##   med_low    7      10       10    0
##   med_high   2       9       18    0
##   high       0       0        0   22

From the cross tabulation above we can see that the model appears to predict the correct results reasonably well.

K-means clustering

library(MASS)
data(Boston) # reloading the Boston data set
boston_scaled <- scale(Boston) # standardizing the data set
dist_eu <- dist(boston_scaled) # calculating the distances between the observations
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

The Euclidian distances calculated and summarized.

km <-kmeans(Boston, centers = 4) # running the k-means algorithm on the dataset, 4 was chosen to be the number of clusters
pairs(Boston, col = km$cluster)

This picture is too big and complicated to comprehend, so let’s take only part of it for closer examination.

pairs(Boston[1:5], col = km$cluster)

Now let’s try to find out what is the optimal number of clusters. One way to determine the number of clusters is to look at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. When you plot the number of clusters and the total WCSS, the optimal number of clusters is when the total WCSS drops radically.

set.seed(123)
k_max <- 10 # determining the maximum number of clusters
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss}) # calculating the total within sum of squares
qplot(x = 1:k_max, y = twcss, geom = 'line') # visualizing the results

The optimal number of clusters seems to be 2. Let’s run the K-means algorithm again with this number of clusters:

km <-kmeans(Boston, centers = 2)
pairs(Boston, col = km$cluster)

pairs(Boston[1:5], col = km$cluster)